The simple linear regression model can be easily extend to a a multiple linear regression model where we have more than one predictor variable.
There are $p$ predictor variables $X_1, X_2, \ldots, X_p$ and we assume the following model: $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_p X_p + \epsilon$$
Where $\epsilon \sim N(0,1)$.
To check if the model fits well and the assumptions of normality and homoscedasticity hold it is helpful to look at the plot of residuals vs. fitted values. Let's look at the following examples where we assume the simple linear model $Y = \beta_0 + \beta_1X + \epsilon$
# Residuals vs. Fitted
x = np.linspace(0, 12, 50)
error = norm.rvs(size=len(x), scale=5)
y = [None] * 3
y[0] = 10 + 5 * x + 2 + error # good example
y[1] = 10 + 1.5 * x + 2*x**2 + error # quadratic relationship
y[2] = 10 + x + error * x # heteroscedasticity
mods = [sm.OLS(y[i], sm.add_constant(x)).fit() for i in range(3)]
$R^2$ is a statistic that will give some information about the goodness of fit of a model. It is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
An $R^2$ of 1 indicates that the regression predictions perfectly fit the data.
$$R^2 = \frac{\text{RegrSS}}{\text{TotalSS}}= 1 - \frac{\text{ResidSS}}{\text{TotalSS}}$$$$\text{TotalSS} = \sum_i(y_i - \bar{y})^2$$$$\text{ResidSS} = \sum_i(y_i - f_i)^2$$$$\text{RegrSS} = \sum_i(f_i - \bar{y})^2$$for i in range(3):
print("R-squared of model %s : %f" % (i + 1, mods[i].rsquared))
R-squared of model 1 : 0.885116 R-squared of model 2 : 0.930183 R-squared of model 3 : 0.010577
Suppose we observe the values $y_1, \ldots, y_t$ sequentially in time. In this scenerio this is the only data that we have, there are no other "predictor" variables. Can we say something about the value of $y_{t+1}$ given all the past observations $y_1, \ldots, y_n$?
The autoregressive model specifies that the output variable depends linearly on its own previous values:
$$ Y_t = \beta_0 + \beta_1Y_{t-1} + \beta_2Y_{t-2} + \ldots + \beta_pY_{t-p} + \epsilon_t$$Where $\epsilon_t \overset{\text{i.i.d}}{\sim} N(0,1)$ is a white noise sequence.
R-squared: 0.615
The moving-average model specifies that the output variable depends linearly on the current and past values of the white noise terms:
$$Y_t = \mu + \theta_1\epsilon_{t-1} + \theta_2\epsilon_{t-2} + \ldots + \theta_q\epsilon_{t-q} + \epsilon_t$$Where $\mu$ is the mean of the series and $\epsilon_t$ a white noise terms.
Rather than using past values of the forecast variable in a regression, a moving average model uses past forecast errors in a regression-like model. We do not observe the values of
$\epsilon_t$, so it is not really a regression in the usual sense.
# Example moving average series
N = 60
errors = norm.rvs(size=N, scale=2)
mu = 10
y1 = np.repeat(mu, N) # MA(1)
y3 = np.repeat(mu, N) # MA(3)
for t in range(2, N):
y1[t] = mu + 1.5 * errors[t - 1] + errors[t]
y3[t] = mu + 1 * errors[t - 3] + 1.5 * errors[t - 2] + \
2 * errors[t - 1] + errors[t]
When workting with $AR(p)$ and $MA(q)$ models we assume that the time series $Y_t$ is (weakly) stationary i.e.
In practice, this means that no trend or seasonality is present in our time series. We will deal with those later. See below examples of stationary and non-stationary time series.
N = 150
errors = norm.rvs(size=N, scale=5)
mu = 1
y = np.zeros((N, 4))
for t in range(1, N):
y[t][0] = mu + errors[t] # Stationary time series
y[t][1] = t / 2 + errors[t] # Linear trend
y[t][2] = mu + t * errors[t] # Increasing varianve
y[t][3] = 15 * np.sin(t / 10) + errors[t] # Seasonality
$MA(p)$: Autocorrelations are zero for lag $k > p$
$AR(p)$: Partialautocorrelations are zero for lag $k > p$
Suppose that $Y_t$ is a time series with a linear trend, i.e.:
$$Y_t = \mu + \alpha t + Z_t$$Where $Z_t$ is a stationary process. Instead of modelling $Y_t$ we can look at $$\tilde{Y_t} = Y_t - Y_{t-1} = \alpha + Z_t - Z_{t-1}$$ Then $\tilde{Y_t}$ is a stationary time series.
To remove a quadratic trend we can differentiate the series twice
If a sesonal effect of length $m$ is present, e.g. 12-month cycle, weekly cycle etc. we should consider a series differenced at lag m $$\tilde{Y_t} = Y_t - Y_{t-m}$$
Note: The order of difference operations does not matter.
# Generate series
N = 365
errors = norm.rvs(size=N, scale=20)
y1 = np.arange(N) + errors # linear trend
y2 = 0.005 * np.power(np.arange(N), 2) + errors # quadratic trend
y3 = np.sin(np.arange(0, 365) % 30 / 10) * 80 + errors # monthly cycle
# Apply diferencing
x1 = pd.Series(y1).diff()
x2 = pd.Series(y2).diff().diff()
x3 = pd.Series(y3).diff(30)
$ARIMA$ model captures all the aproaches presented so far:
Each of these components are explicitly specified in the model as a parameter. A standard notation is used of $\text{ARIMA}(p,d,q)$.
$q$ is the order of the moving-average model.
$$Y_t = \mu + \beta_1Y_{t-1} + \ldots + \beta_pY_{t-p} + \theta_1\epsilon_{t-1} + \ldots + \theta_q\epsilon_{t-q} + \epsilon_t$$
Select the model:
Try the chosen models and compare fit statistics to choose the best one
OR
Use automatic selection algorithms to find the best fitting ARIMA model
df = pd.read_csv("data/airpassangers.csv", parse_dates=True, index_col=0)
df.columns = ["passangers"]
df["passangers"].plot()
plt.title("Airline Passangers")
plt.show()
np.log(df["passangers"]).plot()
plt.title("Log-transformed Airline Passangers")
plt.show()
log_passangers = np.log(df["passangers"])
y = log_passangers.diff(12)[12:]
plot_acf_pacf(y)
y.plot()
plt.title("Log-transformed and differenced at lag 12 time series")
plt.show()
# Compare our selection with the auto_arima
auto_model = pmd.auto_arima(y, trace=False, seasonal=False)
auto_model
ARIMA(order=(2, 0, 0), scoring_args={}, suppress_warnings=True)
# Fit the selected model
model_fit = ARIMA(y, order=(2, 0, 0)).fit()
results = model_fit.get_prediction()
conf_int = results.conf_int(alpha=0.05).values
fitted = results.predicted_mean
model_fit.summary()
| Dep. Variable: | passangers | No. Observations: | 132 |
|---|---|---|---|
| Model: | ARIMA(2, 0, 0) | Log Likelihood | 233.131 |
| Date: | Thu, 20 Jan 2022 | AIC | -458.262 |
| Time: | 19:58:23 | BIC | -446.731 |
| Sample: | 01-01-1950 | HQIC | -453.577 |
| - 12-01-1960 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 0.1150 | 0.017 | 6.926 | 0.000 | 0.082 | 0.148 |
| ar.L1 | 0.5540 | 0.074 | 7.515 | 0.000 | 0.410 | 0.698 |
| ar.L2 | 0.2378 | 0.073 | 3.252 | 0.001 | 0.095 | 0.381 |
| sigma2 | 0.0017 | 0.000 | 9.762 | 0.000 | 0.001 | 0.002 |
| Ljung-Box (L1) (Q): | 0.01 | Jarque-Bera (JB): | 7.78 |
|---|---|---|---|
| Prob(Q): | 0.94 | Prob(JB): | 0.02 |
| Heteroskedasticity (H): | 0.40 | Skew: | 0.33 |
| Prob(H) (two-sided): | 0.00 | Kurtosis: | 3.99 |
model_fit.plot_diagnostics(figsize=(15, 8))
plt.show()
Its now time to practice and test what we've learnt in time series modelling by developing a couple of algorithmic trading strategies! Our plan of action will be:
# Retrieving stock price data
stocks = ['GOOG', 'AAPL', 'FB', 'AMZN', 'GE', 'NFLX', 'JPM']
data = web.DataReader(stocks, 'yahoo', start='2020/11/10', end='2021/11/10')
data.head()
# Select Google closing data
prices = data['Adj Close']['GOOG']
fig, ax = plt.subplots()
ax.plot(prices)
ax.set_title("GOOG Stock Price")
ax.set_xlabel("Date")
ax.set_ylabel("Price ($)")
plt.show()
plot_acf_pacf(prices)
# Get stock returns
returns = prices.pct_change().copy()
fig, ax = plt.subplots()
ax.plot(returns)
ax.set_title("GOOG Stock Returns")
ax.set_xlabel("Date")
ax.set_ylabel("Change in Price (%)")
plt.show()
plot_acf_pacf(returns[1:])
# Compare with the auto_arima
auto_model = pmd.auto_arima(prices, trace=True, stepwise=False)
def arima_trading_backtest(prices, order, buy_ret=0.01, sell_ret=-0.01, train_size=0.3, verbose=True):
# Preparing returns data
p, d, q = order
returns = prices.pct_change()[1:].copy()
prices = prices[1:]
# Train-test split and initial model training
N = returns.shape[0]
train = returns.head(round(N*train_size))
test = returns.tail(N-round(N*train_size))
model = ARIMA(train, order=(p,d-1,q)).fit()
# Trading simulation
holding = False
balance = 100
preds = []
trades = []
for t in test.index:
i = prices.index.get_loc(t)
# Forecast
pred = model.forecast()[i]
preds.append(pred)
# Buying and selling conditions
trade = False
if (not holding) and (pred>buy_ret):
trade = True
holding = True
trades.append(('buy', prices.index[i-1]))
buy_price = prices.iloc[i-1]
elif holding and (pred<sell_ret):
trade = True
holding = False
sell_price = prices.iloc[i-1]
profit = (sell_price - buy_price)/buy_price
trades.append(('sell+' if profit>0 else 'sell-', prices.index[i-1]))
balance += balance*(profit)
# Verbose prints
if trade and verbose:
if not holding:
print(f'Sold @ ${sell_price}, {prices.index[i-1]}')
print(f'Overall trade gain: {100*profit}%')
else:
print(f'Bought @ ${buy_price}, {prices.index[i-1]}')
print(f'Predicted return: {pred}')
print(f'Actual return: {returns[t]}')
print('')
# Re-train model
model = ARIMA(returns[:t], order=(p,d-1,q)).fit()
# Present-value of portfolio if still holding
if holding:
sell_price = prices[t]
profit = (sell_price - buy_price)/buy_price
trades.append(('sell+',t) if profit>0 else ('sell-',t))
balance += balance*(profit)
print(f'Overall Balance: ${balance}')
return pd.Series(preds, index=test.index), trades
preds, trades = arima_trading_backtest(prices, order=(1,1,0))
preds, trades = arima_trading_backtest(prices, order=(0,1,5))
# Plotting prices
fig, ax = plt.subplots()
ax.plot(prices)
for trade, t in trades:
plt.axvline(x=t, color= 'g' if trade=='sell+' else 'r' if trade=='sell-' else 'k' , linestyle='--')
ax.set_title("GOOG Stock Price: ARIMA Algorithmic Trading Backtest")
ax.set_xlabel("Date")
ax.set_ylabel("Price ($)")
plt.show()
# Plotting prices
fig, ax = plt.subplots()
ax.plot(returns[1:])
ax.plot(preds)
for trade, t in trades:
plt.axvline(x=t, color= 'g' if trade=='sell+' else 'r' if trade=='sell-' else 'k' , linestyle='--')
ax.set_title("GOOG Stock Returns: ARIMA Algorithmic Trading Backtest")
ax.set_xlabel("Date")
ax.set_ylabel("Price ($)")
plt.show()
for ticker in stocks[1:]:
prices = data['Adj Close'][ticker]
print(ticker)
arima_trading_backtest(prices, order=(0,1,5), verbose=False)
print('')